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POPULATION GENETICS OF NEUTRAL MUTATIONS IN 
EXPONENTIALLY GROWING CANCER CELL POPULATIONS 

By Rick Durrett^ 
Duke University 

In order to analyze data from cancer genome sequencing projects, 
we need to be able to distinguish causative, or "driver," mutations 
from "passenger" mutations that have no selective effect. Toward this 
end, we prove results concerning the frequency of neutural mutations 
in exponentially growing multitype branching processes that have 
been widely used in cancer modeling. Our results yield a simple new 
population genetics result for the site frequency spectrum of a sample 
from an exponentially growing population. 

1. Introduction. It is widely accepted that cancers result from an accu- 
mulation of mutations that increase the fitness of tumor cells compared to 
the cells that surround them. A number of studies [Sjoblom et al. (2006), 
Wood et al. (2007), Parsons et al. (2008), The Cancer Genome Atlas (2008) 
and Jones et al. (2008, 2010)] have sequenced the genomes of tumors in or- 
der to find the causative or "driver" mutations. However, due to the large 
number of genes being sequenced, one also finds a large number of "passen- 
ger" mutations that are genetically neutral and hence have no role in the 
disease. 

To explain the issues involved in distinguishing the two types of mutations, 
it is useful to take a look at a data set. Wood et al. (2007) did a "discovery" 
screen in which 18,191 genes were sequenced in 11 colorectal cancers, and 
then a "validation" screen in which the top candidates were sequenced in 
96 additional tumors. The 18 genes that were mutated five or more times 
mutated in the discovery screen are given in Table 1. Here NS is short for 
nonsynonymous mutation, a nucleotide substitution that changes the amino 
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Table 1 





Colorectal 


cancer data from 


Wood et al. (2007) 








NS mutations 


Passenger probability 




(jr6I16 


Discovery 


Validation 


External 


SNP 


NS/S 


APC 


171 


138 


0.00 


0.00 


0.00 


KRAS 


79 


62 


0.00 


0.00 


0.00 


TP53 


79 


61 


0.00 


0.00 


0.00 


PIK3CA 


28 


23 


0.00 


0.00 


0.00 


FBXW7 


14 


9 


0.00 


0.00 


0.00 


EPHA3 


10 


6 


0.00 


0.00 


0.00 


TCF7L2 


10 


7 


0.00 


0.00 


0.01 


ADAMTSL3 


9 


5 


0.00 


0.00 


0.03 


NAV3 


8 


3 


0.00 


0.01 


0.D4 


GUCY1A2 


7 


4 


0.00 


0.00 


0.01 


MAP2K7 


6 


3 


0.00 


0.00 


0.02 


PRKDl 


5 


3 


0.00 


0.00 


0.39 


MMP2 


5 


2 


0.00 


0.02 


0.61 


SEC8L1 


5 


2 


0.00 


0.03 


0.63 


GNAS 


5 


2 


0.00 


0.04 


0.67 


ADAMTS18 


5 


2 


0.00 


0.07 


0.82 


RET 


5 


2 


0.01 


0.17 


0.89 


TNN 


5 





0.00 


0.11 


0.81 


acid in the corresponding 


protein. The top four genes in 


the hst are 


weU 


known to be 


associated with cancer. 









• Adenomatous polyposis coli (APC) is a tumor suppressor gene. That is, 
when both copies of the gene are knocked out in a ceh, uncontrohed growth 
results. It is widely accepted that the first stages of colon cancer are the 
loss of both copies of the APC gene from some cell, see, e.g.. Figure 4 in 
Luebeck and Moolgavkar (2002). 

• Kras is an oncogene, i.e., one which causes trouble when a mutation in- 
creases its expression level. Once Kras is turned on it recruits and activates 
proteins necessary for the propagation of growth factors. 

• TP53 which produces the protein p53 (named for its 53 kiloDalton size) is 
loved by those who study "complex networks," since it is known to be im- 
portant and appears with very high degree in protein interaction networks. 
p53 regulates the cell cycle and has been called the "master watchman" 
referring to its role in conserving stability by preventing genome mutation. 

• The protein kinase PIK3CA is not as famous as the other three genes 
(e.g., it does not yet have its own Wikipedia page) but it is known to be 
associated with breast cancer. In a study of eight ovarian cancer tumors 
in Jones et al. (2010), an G mutation was found at base 180,434,779 
on chromosome 3 in six tumors. 
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The next three genes on the hst with the unromantic names FBXW7, 
EPHA3, and TCF7L2 are all either known to be implicated in cancer or 
are likely suspects because of the genetic pathways they are involved in. Use 
google if you want to learn more about them. 

The methodology that Wood et al. (2007) used for assessing passenger 
probabilities is explained in detail in Parmigiani et al. (2007). In princi- 
ple this is straightforward: one calculates the probability that the observed 
number of mutations would be seen if all mutations were neutral. The first 
problem is to estimate the neutral mutation rate. In the column labeled "ex- 
ternal" this estimate comes from experimentally observed rates, while in the 
column labeled "SNP" they used the mutations observed in the study, with 
the genes declared to be under selection excluded. The estimation problem 
is made more complicated by the fact that DNA mutation rates are context 
dependent. The two nucleotides in what geneticists call a CpG (the p refers 
to the phosphodiester bond between the adjacent cytosine and the guanine 
nucleotides) each mutate at roughly 10 times the rate of a thymine. 

The third method for estimating passenger probabilities, inspired by pop- 
ulation genetics, is to look at the ratio of nonsynonymous to synonymous 
mutations after these numbers have been scaled by dividing by the number 
of opportunities for the two types of mutations. While the top dozen genes 
show strong signals of not being neutral, as one moves down the list the 
situation becomes less clear, and the probabilities reported in the last three 
columns sometimes give conflicting messages. The passenger probabilities in 
the last column are in most cases higher and in some cases such as NAV3 
and tthe last three genes in the table are radically different. My personal 
feeling is that in this context the NS/S test does not have enough mutations 
to give it power to detect selection, but perhaps it is the other two methods 
that are being fooled. 

To investigate the number and frequency of neutral mutations observed 
in cancer sequencing studies, we will use a well-studied framework in which 
an exponentially growing cancer cell population is modeled as a multi-type 
branching process. Cells of type i>0 give birth at rate and die at rate 6j, 
where the growth rate Aj = Oj — ftj > 0. Thinking of cancer we will restrict 
our attention to the case in which i — ?> Aj is increasing. To take care of 
mutations, we suppose that individuals of type i also give birth at rate Uj+i 
to individuals of type i + 1 that have one more mutation. This is slightly 
different from the approach of having mutations with probability Uj+i at 
birth, which translates into a mutation rate of OjUi+i, and this must be kept 
in mind when comparing with other results. 

Let Tfc be the time of the first type k mutation and let be the time of the 
first type k mutation that gives rise to a family that lives forever. Following 
up on initial studies by Iwasa, Nowak and Michor (2006), and Haeno, Iwasa 
and Michor (2007), Durrett and Moseley (2010) have obtained results for 



4 



R. DURRETT 



and limit theorems for the growth of Zk{t), the number of type A:'s at time t. 
These authors did not consider ak, but the extension is trivial: each type k 
mutation gives rise to a family that lives forever with probability X^/a^, so 
all we have to do is to replace in the limit theorem for by u^Xk/ak. 

1.1. Wave results. To begin to understand the behavior of neutral 
mutations in our cancer model, we first consider those that occur to type 
O's, which are a branching process Z{)[t) in which individuals give birth at 
rate oq and die at rate 5o < oo- It is well-known, see O'Connell (1993), that if 
we condition ZQ{t) to not die out, and let lb(*) be the number of individuals 
at time t whose families do not die out, then lo(t) is a Yule process in 
which births occur at rate 7 = Ao/oq. Our first problem is to investigate the 
population site frequency spectrum, 

(1) F{x) = lim Ft{x), 

where Ft{x) is the expected number of neutral "passenger" mutations present 
in more than a fraction x of the individuals at time t. To begin to compute 
F{x), we note that 

(2) Yo{t)/Zo{t) in probability, 

since each of the ZQ{t) individuals at time t has a probability 7 of starting 
a family that does not die out, and the events are independent for different 
individuals. 

It follows from (2) that it is enough to investigate the frequencies of neu- 
tral mutations within Yq. If we take the viewpoint of the infinite alleles 
model, where each mutation is to a type not seen before, then results can 
be obtained from Durrett and Schweinsberg's (2005) study of a gene dupli- 
cation model. In their system there is initially a single individual of type 1. 
No individual dies and each individual independently gives birth to a new 
individual at rate 1. When a new individual is born it has the same type as 
its parent with probability 1 — r and with probability r is a new type which 
is different from all previously observed types. 

Let T/v be the first time there are N individuals and let Fs^n be the 
number of families of size > S" at time T/v • Omitting the precise error bounds 
given in Theorem 1.3 of Durrett and Schweinsberg (2005), that result says 

(3) Fs,N ~ rT NS'^^^^-''^ for 1< 5 < N^-'' . 

The upper cutoff on S is needed for the result to hold. When S » N^^"^ , 
EFs.N decays exponentially fast. 

As mentioned above, the last conclusion gives a result for a branching 
process with mutations according to the infinite alleles model, a subject first 
investigated by Griffiths and Pakes (1988). To study DNA sequence data, we 
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are more interested in the frequencies of individual mutations. Using ideas 
from Durrett and Schweinsberg (2004) it is easy to show: 

Theorem 1. If passenger mutations occur at rate v then F{x) = v j^x. 

This theorem describes the population site frequency spectrum. As in 
Section 1.5 of Durrett (2008), this can be used to derive the site frequency 
spectrum for a sample of size n. Let ?7n,m be the number of sites in a sample 
of size n where m individuals in the sample have the mutant nucleotide. If 
one considers the Moran model in a population of constant size iV then 

(4) Er]n m = for 1 <m < n. 

m 

Using Theorem 1 now, we get a new result concerning the population genet- 
ics of exponentially growing populations. Here we are considering a Moran 
model in an exponentially growing population, see, e.g.. Section 4.2 of Dur- 
rett (2008), rather than a branching process. 

Theorem 2. Suppose that the mutation rate is v and the population 
size t units before the present is N(t) = Ne~^^ then as N ^ oo 

nv 1 



(5) ^7?. 



2 <m <n, 



7 m(m — 1) ' 

— •log(A^7), m = l, 

7 



nv 



where ajy ~ bj^ means aj\[/bN 1- 

To explain the result for m = 1, we note that, as Slatkin and Hudson 
(1991) observed, genealogies in exponentially growing population tend to be 
star-shaped. The time required for io(^) to reach size A^7 (and hence roughly 
the time for Z(){t) to reach size N) is ~ (I/7) log(A^7), so the number of 
mutations on our n lineages is roughly nu times this. Note that, (i) for a 
fixed sample size, Er]n,m^ 2 < m < n are bounded independent of the final 
population size, and (ii) in contrast to (4), the sample size replaces the 
population size in formula (5). 

The result in Theorem 2 is considerably simpler than previous formulas. 
Let L{t) be the number of lineages t units of time before the present. For 
2 < A; < n let Tfc = sup{t : L{t) > k} be the first time at which the number of 
lineages is reduced to — 1, and let Sk = Tk — T^+i where T„+i = 0. Griffiths 
and Tavare (1998) have shown that under some mild assumptions (coalescent 
times have continuous distributions, only two lineages coalesce at once, all 
coalescence events have equal probability, Poisson process of mutations) the 
probability that a segregating site has b mutant bases is 

_ (n - 6 - 1)!(6 - 1)! EL2 - 1) (lZl)ES, 
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m, the number of mdividuais in sample 



Fig. 1. Simulated site frequency spectrum when v — 7, sample size n — 10, and population 
size Af = 100,000. 



To apply this result to the coalescent with population size N{t) = Ne~^^ , 
one needs formulas for ESk- See for example (52) in Polanski, Bobrowski, 
and Kimmel (2003). However, these formulas are complicated and difficult 
to evaluate numerically, since they involve large terms of alternating size. 
To connect (6) with the result in Theorem 2, we write 

q^^ = l_ Efc=2 - k)ESk 

Equation (31) below will show that ESn ~ log while for 2 <k <n, ESk = 
0(1) so we have 1 — qn,i = 0(1/ log A^) in agreement with (5). 

To check (5) Yifei Chen, a participant in a summer REU associated with 
Duke's math biology Research Training Grant, performed simulations. Fig- 
ure 1 gives results for the average of 100 simulations with the indicated 
parameters. The agreement is almost perfect for m>2 but the formula con- 
siderably over estimates the number of singletons with (5), predicting 69.07 
versus an observed value of about 40. Given the approximations used in the 
proof of Theorem 2 in Section 2 for the case m = 1, this is not surprising. 
The next result derives a much better result for Eijn^i which gives a value 
of 36.66. See (27) for details of the numerical calculation. 
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Theorem 3. 



E 



n k 



7 



k=l 



n-\-k n + k —1 



Here w means simply that this is an approximation which is better for 
finite N. As N ^ oo the right-hand side ~ (nz^/7) log(A^7) the answer in 
Theorem 2. 

The results for Ei]n,m are useful for population genetics, but are not really 
relevant to cancer modeling. To investigate genetic diversity in the expo- 
nentially growing population of humans, you would sequence the DNA of a 
sample of individuals from the population. However, in the study of cancer 
each patient has their own exponentially growing cell population, so it is 
more interesting to have the information provided by Theorem 1 about the 
fraction of cells in the population with a given mutation. 

Numerical example. To illustrate the use of Theorem 1 suppose 7 = 
'^o/'^o = 0.01 and v = 10~^. In support of the numbers we note that Bozic 
et al. (2010) estimate that the selective advantage provided by a typical 
cancer driver mutation is 0.004 zb 0.0004. As for the second, if the per nu- 
cleotide mutation rate is 10"^ and there are 1000 nucleotides in a gene then 
a mutation rate of 10~^ per gene results. In this case Theorem 1 predicts 
if we focus only on one gene then the expected number of mutations with 
frequency > 0.1 is 



so, to a good first approximation, no particular neutral mutation occurs with 
an appreciable frequency. Of course, if we are sequencing 20,000 genes then 
there will be a few hundred passenger mutations seen in a given individual. 
On the other hand there will be very few specific neutral mutations that will 
appear multiple times in the sample. 

1.2. Wave 1 results. We refer to the collection of type k individuals as 
wave k. In order to analyze the cancer data, we also need results for neutral 
mutations in waves > of the multitype branching process. We begin by 
recalling results from Durrett and Moseley (2010) for type 1 individuals in 
the process with Zq{Q) = 1 when we condition the event that the type O's 
do not die out. Let ai be the time of the first "successful" type 1 mutation 
that gives rise to family that does not die out. Then o"i has median 



(7) 



F(O.l) = 10~^+2+^ = 0.01 





P{ai>s\,^ + x/\Q)^{l + e') 



-1 
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For (8) see (7) in Durrett and Moseley (2010) and drop the 1 inside the 
logarithm. The second result follows from the reasoning for (6) there. 

In investigating the growth of type I's, it is convenient mathematically to 
assume that Z*{t) = Foe^o* for t € ( — cxD,oo) and to let Z^{t) be the number 
of type k's at time t in this system. Here the star is to remind us that we 
have extended Zq to negative times. The probability of a mutation to type 
1 at times t < is < VqUi/Xq. In the concrete example ui/Aq = 10"^, so this 
is likely to have no effect. The last calculation omits two details that almost 
cancel out. When we condition on survival of the type O's, EVq = ao/Ao, 
but the probability a type 1 mutation survives for all time is Ai/ai. Since 
ao ~ cii we are too low by a factor of Ai/Aq = 2. 

Durrett and Moseley (2010) have shown: 

Theorem 4. If we regard Vq as a fixed constant then as t ^ oo, e~'^^* x 
Z^{t) —7- Vi where V\ is the sum of the points in a Poisson process with mean 
measure /i(x,oo) = Cfj,^iUiVox~" with a = Aq/Ai and 

(10) ,^,,.1(^1) V). 

The Laplace transform E{e~^^^\Vo) = exp(— c/i^itiiVb^") where Ch^i = 
c^^ir(l — q). IfVo is exponential(Ao/ao) then 

(11) Eexp{-eVi) = (1 + c/,,ini(ao/Ao)r)-^ 

Here, and in what follows, constants like c^^i, Ch,i, and ce,i will depend on 
the branching process parameters aj and bi, but not on the mutation rates 
Ui. The constant here is equal to, but written differently from, the one in 
Durrett and Moseley 

"^hA = ^ f 'r(l + a)r(l -a) = -^(^) "ar(a)r(l - a). 

\o\XiJ aiXo\XiJ 

To prepare for later results note that the formula for the Laplace transform 
shows that conditional on Vq, Vi has a one sided stable distribution with 
index a. 

The point process in Theorem 4 describes the contributions of the suc- 
cessful type 1 mutations to Zi{t). The first such mutation occurs at time cJi, 
which has median Sy2- The derivation of Theorem 4 is based on the obser- 
vation that a mutation at time s will grow to size VFi by time t, 
where Wi has distribution 

6i ^ Ai ■ 1 / \ / \ 

Wi =d — Oo H exponential(Ai/ai) 

oi ai 

—X^ (s—s^ ) 

and hence make a contribution of e to the limit Vi . Thus we expect 

that most of the mutations that make a significant contribution will come 
within a time 0(1/Ai) of sJ/2- 
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The complicated constants in Theorem 4 can be simphfied if we instead 
look at the limit 

e-Ai(*-«i/2)^*(t) ^ =^ Vi exp(AisJ/2). 
Using the definition of sj^g ™ (8) ^'^id recalling q = Aq/Ai we see that 

exp(AiSw2) = (^^^ . o\ 

and hence using (11) 



(12) ^exp(-6'yi)= l + ar(a)r(l-a 



1 

The combination of Gamma functions is easy to evaluate, since Euler's re- 
flection function implies that 

(13) ar(a)r(l-a) = — > 1. 

sin(7ra) 

A second look at (12) shows that aiVi/\i has a distribution that only de- 
pends on a. For comparison, note that if Vq is exponential(Ao/ao) then 
flo^o/-^o is exponential 1). 

Using results for one-sided stable laws, Durrett et al. (2011) were able to 
prove results about the genetic diversity of wave 1. Define Simpson's index 
to be the limiting probability two randomly chosen individuals in wave 1 are 
descended from the same type 1 mutation. In symbols, it is the p = 2 case 
of the following definition 

i=l ^1 

where Xi > X2 > ■ ■ ■ are points in the Poisson process and Vi is the sum. The 
result for the mean, which comes from a result of Fuchs, Joffe and Teugels 
(2001), is much simpler than one could reasonably expect. 

Theorem 5. ER2 = 1 — a. 

After this paper was written Jason Schweinsberg explained to me that 
the points Yi = Xi/Vi have the Poisson-Dirichlet distribution PD(a,0), so 
Theorem 5 follows from (3.6) in Pitman (2006). For our purposes it is easier 
to refer to (6) in Pitman and Yor (1997) where it is shown that 



00 „i 

r(a)r(i-a)yo 
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Taking /(x) = xP we find that Rp = Y,i ^V^k ^as 

ERri — E \ YJ' = -— — ^ — - . 

P ^ T{l-a)T{p) 



Using formulas in Logan et al. (1973) one can derive results for the dis- 

— 1/2 

tribution of i?2 • Work of Darling (1952) leads to information about the 
distribution of the fraction in the largest clone Xi/Vi. In particular, 

Theorem 6. Vi/Xi has mean 1/(1 — a). 



Since l/x is convex, E{Xi/Vi) > l/E{Vi/Xi) = 1 - a. 

Theorems 5 and 6 suggest that if we are interested in understanding neu- 
tral mutations in say 90% of the population when wave 1 is dominant, then 
we can restrict our attention to the families generated by a small number of 
the most prolific type 1 mutants. (The number we need to consider will be 
large if a is close to 1.) The result in (7) suggests that we can ignore neutral 
mutations within the descendants of these type 1 mutations. Mutations that 
occur on the genealogies of the ith largest mutations will appear in all of 
their descendants and hence have frequency Xi/Vi. As remarked above (and 
explained in more detail in Section 3), the genealogies of the most prolific 
type 1 mutants will be approximately star-like so they will mostly have dif- 
ferent mutations. Note that here, in contrast to the reasoning that led to 
(21) there are several individuals founding different subpopulations whose 
genealogies have collected neutral mutations. 

1.3. Wave k results. Once Theorem 4 was established it was straightfor- 
ward to extend the result by induction. Let Ok = Xk-i/^k, 

(14) c^,fc = — (t^I r(afc) and Ch,k = Cf,^k^{l - at). 
Let CQfl = ao/Ao, /^io = 1 and inductively define for A; > 1 

(15) ce,k = ce,k^iclf'-\ 

(16) ^, = ^,_,nJ"A-i=fJ^;°/^-\ 
Durrett and Moseley (2010) have shown: 

Theorem 7. Suppose ZQ{t) = V^e'^'o* for t G (-00,00) where Vq is 
exponential(Ao /oo) • 

e"^'=%*(t)^T4 a.s. 

Let 7"^"^ be the a-field generated by Z*{t), j <k -1, t>0. {Vk\Tl^^) is 
the sum of the points in a Poisson process with mean measure /.i(x, 00) = 
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and hence 

(17) i?e-^^^=(l + c,,fc/ifc^^°/^^)-i. 

Using Theorem 7 it is easy to analyze t^+i, the waiting time for the first 
type k + 1. Details of the derivations of (18) and (19) are given in Section 4. 
The median of t^+i is 

1 / A'**"^'^'" \ 1 1 

(18) t\+^ = ^ log — = IT log(Afc) - — log(ce,fc/Xfc+i) 

and as in the case of ri 

Again the result for the median s^J^ of time cjfc+i of the first mutation 
to type k + 1 with a family that does not die out can be found by replacing 

Uk+i by Uk+iXk+i/ck+i- 

Formula (18), due to Durrett and Moseley (2010), is not very transparent 
due to the complicated constants. We will obtain a more intuitive result by 
looking at the difference s^J^ — Sy2- After some algebra, hidden away in 
Section 4, we have 

(19) ,;«-,f,, = ±,„,(-|^) -^,„,Kr(„or(i-^ 

Neutral mutations. Returning to our main topic, it follows from the first 
conclusion in Theorem 7 that the results of Theorems 5 and 6 hold for wave 
k when a is replaced by = Xk-i/Xk- Suppose for simplicity that k = 2. 
In the concrete example 02 = 2/3, so ER2 = 1/3 and again there will be a 
small number of type 2 mutations that occur at times close to s^i^ that are 
responsible for 90% of the population. If we let xi > X2 > • • • be the fractions 
of the type 1 population that result from the most prolific type 1 mutants, 
then the jth most prolific type 2 mutation will trace its lineage back to 
the ith most prolific type 1 mutation with probability Xj. All of the type 2 
mutants who trace their ancestry back to the same type 1 mutant will have 
lineages that coalesce at times near sjy2- Working backwards from that time 
the genealogy of the most prolific type 1 mutations will be star like. At this 
point a picture is worth a hundred words, see Figure 2. 

1.4. Relationship to Bozic et al. (2010). The inspiration for this investi- 
gation came from a paper by Bozic et al. (2010). Their model takes place in 
discrete time to facilitate simulation and their types are numbered starting 
from 1 rather than from 0. At each time step, a cell of type j > 1 either 
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Fig. 2. Genealogy of wave 2 individuals. Here 0.6, 0.25, and 0.1 are the frac- 
tions of the type 1 population derived from the three most prolific type 1 mutations. 
If these numbers look odd recall that in the example ER =1/2 for wave 1, while 
(0.6)^ + (0.25)^ + (0.1)2 ^ 0.4325. 

divides into two cells, which occurs with probability hj, or dies with proba- 
bility dj where dj = {1 — sy /2 and bj = 1 — dj. It is unfortunate that their 
birth probability bj is our death rate for type j cells. We will not resolve 
this conflict because but we want to preserve their notation make it easy to 
compare with the results in the paper. 

In addition, at every division, the new daughter cell can acquire an ad- 
ditional driver mutation with probability u, or a passenger mutation with 
probability u. They find the following result for the expectation of Mk, the 
number of passenger mutations in a tumor that has accumulated k driver 
mutations: 

(20) EMk = —log^logk. 

The derivation of this formula suffers from two errors due to a fundamental 
misconception, and loses accuracy because of some dubious arithmetic. The 
first error is to claim that (see Section 5 of their supplementary materials) 

(21) EMu = ^Eak, 

where T is the average time between cell divisions. In essence (21) asserts 
that the passenger mutations in the population are exactly those that have 
appeared along the genealogy of the cell with the first type k mutation that 
gives rise to a family that lives forever. However as Theorems 4 and 7 show, 
this is wrong because after the initial wave more than one mutation makes 
a significant contribution to the size of the type k population. 

The second erroneous ingredient is (S5) in their supplementary materials. 
In quoting that result below we have dropped the 1+ inside the log in their 



GENETICS OF EXPONENTIALLY GROWING POPULATIONS 13 



formula, since it disappears in their later calculations and this makes their 
result easier to relate to ours. 

r,t ^ riog[(l - g,)/(n5,(l - - 1/(6,(2 - u)))] 
(22) E{<y,^,-a,) = log[6,(2-n)] ' 

where qj is probability that a type j mutation dies out. By considering what 
happens on the first step: 



(23) qj f« dj + bjqj and hence q^ k, ^ 1 — 2js, 

Oj 1 + J s 



where the last approximation assumes that s is small. 

Before we start to compare results, recall that Bozic et al. (2010) number 
their waves starting with 1 while our numbers start at 0. When the differ- 
ences in notation are taken into account (8) agrees with the j = 1 case of 
(22). The death and birth probabilities in the model of Bozic et al. (2010) 
are di = {1 - s)/2 and bi = 1 - di = {1 + s)/2, so log(26i) log(l + s) s. 
qj « (1 — js) /(I + js) ~ 1 — 2js. Taking into account the fact that mutations 
occur only in the new daughter cell at birth, we have ui = biu, so when j = 1 
(22) becomes 

1 / .2 
E{a2 - cJi) « - log 



ui ■ 2s 



Setting Xj = {j + l)s, and = in our continuous time branching process, 
we have ai/oo ~ 1 and this agrees with (8). 

Numerical example. To match a choice of parameters studied in 
Bozic et al. (2010), we wiU take u = 10"^ and s = 0.01, so Ui = biU ?a 5 x 10^^, 
and 

1 / 10""^ \ 

s|,„ ^ log = 100 log(lOOO) = 690.77. 

0.01 X 10-6 .0.027 ^ 

Note that by (9) the fluctuations in ai are of order I/Aq = 100. 

To connect with reality, we note that for colon cancer the average time 
between cell divisions is T = 4 days, so 690.77 translates into 7.57 years. In 
contrast, Bozic et al. (2010) compute a waiting time of 8.3 years on page 
18,546. This difference is due to the fact that the formula they use [(1) on 
the cited page] employs the approximation 1/2 ~ 1. 

Turning to the later waves, we note that: 

(i) the first "main" term in (19) corresponds to the answer in (22). 

(ii) by (13), akT{ak)T{l — a^) = Tra^/ sin(7rafc) > 1, so the "correction" 
term not present in (22) is < 0, which is consistent with the fact that the 
heuristic leading to (22) considers only the first successful mutation. 
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Table 2 

Comparison of expected waiting times from (19) and (22). The numbers m parentheses 
are the answers converted into years using T = 4: as the average number of days between 

cell divisions 







Main 


Corr. 




From (19) 


From (22) 


^1/2 
„2 

■^1/2 
„3 

■^1/2 


*l/2 
*l/2 


690.77 
394.41 
280.36 




45.15 
44.15 


^1/2 
„2 

^1/2 
„3 

^1/2 


690.77 (7.57) 
1040.03 (11.39) 
1276.24 (13.98) 


550.87 (6.04) 
895.39 (9.81) 
1149.79 (12.60) 



To obtain some insight into the relative sizes of the "main" and the 
"correction" terms in (19), we will consider our concrete example in which 
Ai = (i + l)s and a, = ftj+i ^ 1/2, so for z > 1 

i 1 , ( {i + lfs \ 1 ^ [ -Kaj \ 

'1/2 '^1^- {i + l)s^''\u,+^{i + 2)) is °Hsin(vra,)| 

Taking s = 0.01, u = 10"^ and = 5 X 10 ^ leads to the results given in 
Table 2. 

The values in the last column differ from the sum of the values in the first 
column because Bozic et al. (2010) indulge in some dubious arithmetic to 
go from their formula 

1 / 2fs \ 
E{aj+i-aj) = — log . , 

to their final result 

First they use the approximation + 1) ~ 1 and then ^'jZi ~ Iq- ^^e 
first row of the table this means that their formula underestimates the right 
answer by 20%. Bozic et al. (2010) tout the excellent agreement between 
their formula and simulations given in their Figure S2. However, a closer 
look at the graph reveals that while their formula underestimates simulation 
results, our answers agree with them almost exactly. 

2. Proofs for wave 0. 

Proof of Theorem 1. Dropping the subscript for convenience, recall 
that Y{t) is defined to be the number of individuals in the branching process 
Z(t) with an infinite line of descent and that Y{t) is a Yule process with 
birth rate 7 = Ao/ao- For j >1 let Tj = min{t ■.Yt=j} and notice that Ti = 0. 
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Since the j individuals at time Tj start independent copies , . . . of Y, 
well known results for the Yule process imply 

hm e-^'Y\s) = C^, 

s— >oo 

where the are independent exponential mean 1 (here time s in Y^ corre- 
sponds to time Tj + s in the original process) . From the limit theorem for the 
we see that for j > 2 the limiting fraction of the population descended 
from individual i at time Tj is 

which as some of you know has a beta(l, j — 1) distribution with density 

(j-l)(l-x)^-2. 

To prepare for the simulation algorithm it is useful to give an explicit 
proof of this fact. Note that 

((ei,...,Ci)iei + ---+o=i) 

is uniform over all nonnegative vectors that sum to t, so (ri, . . . ,rj) is uni- 
formly distributed over the nonnegative vectors that sum to 1. Now the joint 
distribution of the rj can be generated by letting Ui, . . . , Uj-i be uniform on 
[0, 1], J7(i) < < • . • < C/O-i) be the order statistics, and n = U^^ - U^''^'^ 
where ^7^'^^ = and U'^^^ = 1. From this and symmetry, we see that 

P{ri >x) = P{rj >x)= P{Ui < X for 1 < i < j - 1) = (1 - xy-^ 

and differentiating gives the density. 

If the neutral mutation rate is v then on \Tj,Tjj^i) mutations occur to 
individuals in Y at rate vj, while births occur at rate 7j, so the number of 
mutations Nj in this time interval has a shifted geometric distribution with 
success probability 7/(7 + 2^), i.e., 

(24) P{N, = k)={-^^'-L- forfe = 0,l,2,.... 



The Nj are i.i.d. with mean 



7 7 

Thus the expected number of neutral mutations that are present at frequency 
larger than x is 

00 

^^(l-x)-^ = ^. 

The J = 1 term corresponds to mutations in [Ti,T2) which will be present 
in the entire population. □ 
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Simulation algorithm. The proof of the last result leads to a useful simu- 
lation algorithm. Suppose we have worked our way up to time Tj with j >1 
and the limiting fractions of the descendants of the j individuals at this time 
correspond to the sizes of the intervals 



where the Uj^i, I <i < j, are the order statistics of a sample of j — 1 inde- 
pendent uniforms. 

To take care of mutations in [Tj,Tj^i), we generate a number of muta- 
tions Nj with a shifted geometric distribution given in (24) and associate 
each mutations with an interval {Uj^i-i,Uj^i) with i chosen at random from 



To produce the subdivision at time Tj+i, let V be an independent uniform, 



Note that the interval to be split is not chosen at random but according to 
its length. The simplest explanation of why this is true is that it is needed to 
have the new point added be uniform on (0, 1). For a detailed explanation, 
see Theorem 1.8 of Durrett (2008). 

When we have worked our way down to Tj with j = we stop. To 
find the properties of a sample of size n, we choose points Xi, . . . ,Xn in- 
dependently and uniform on (0,1). For each k a mutation associated with 
{Uk,i-i, Uk,i) appears in all of the individual G {Uk,i-i, Uk,i)- 

Proof of Theorem 2. We begin with a calculus fact, that is, easy for 
readers who can remember the definition of the beta distribution. The rest 
of us can simply integrate by parts. 

Lemma 2.1. If a and b are nonnegative integers 



Differentiating the distribution function from Theorem 1 gives the density 
V l^x^ . We have removed the atom at 1 since those mutations will be present 
in every individual and we are supposing the sample size n> m the number 
of times the mutation occurs in the sample. Conditioning on the frequency 
in the entire population, it follows that for m < 2 < n that 



= Ujfi <[/,■,!<•••< Ujj^i < Ujj = 1 



1,... 




(25) 





where we have used n<^ N and the second step requires m > 2. 
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When m = 1 the formula above gives Eijn^i = oo. To get a finite answer 
we note that Zt = n roughly when Yt = nj so the expected number that are 
present at frequency larger than x is 

7V7 



Differentiating (and multiplying by —1) changes the density from v j^x^ to 

(26) !l('J_(l_(l_a;)A^7)_ijV^(l_3.)A'7-l 

7 X 

Ignoring the constant v for the moment and noticing (^)x™'(l — x)""™ = 
nx(l — x)"~^ when m = 1 the contribution from the second term is 

n Nj{l - x)^^+"-2 dx = n- -^—^ < n 

Jo Nj + n-l 

and this term can be ignored. Changing variables x = y/N^ the first integral 
is 

.1 ^ 



I i(l_(l_^)A^7)(l_^)-l 

Jo X 



dx 



_(!_(!_ y/N^f^){l - y/N^T-^ dy. 

y 

To show that the above is ~ log(A^7) we let Kj\f — )■ 00 slowly and divide the 
integral into three regions [0,J^7v], [-fCAr,iV7/logA^], and [iV7/log A^, A^7]. 
Oustide the first interval, (1 — y/Nj)^'^ — ?> and outside the third, (1 — 
y/N'y)^~^ —7- 1 so we conclude that the above is 

i-N"// log N 1 

0{Kn)+ / -d2/ + 0(loglogiV). 

Jkn y 

As the simulation results cited in the introduction suggest, this approxima- 
tion is somewhat rough. □ 

Proof of Theorem 3. When a mutation that occurs on level j = k + l 
is associated with (Uj^i-i, Uj^i) it affects all members of the sample that land 
in that interval. By symmetry of the joint distribution of the interval lengths, 
we can suppose without loss of generality that i = 1. Think of the k break 
points Uj^i with 1 < i < j — 1 as red points and the n uniforms Xi,. . . , Xn 
as blue. The mutation will affect exactly one individual in the sample if as 
we look from left to right, the first point is blue and the second is red. By 
symmetry this has probability 

n k 
n + k n — 1 + k 
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Taking into account that the mean number of mutations per level is v/"f 
and summing gives desired formula. □ 

Evaluating the constant. Writing M for A^7, 
^ n k 1 / n — 1 



^ = ny — 

^-^n + k n — l + k ^-^n + k V n — l + k 

k=l k=l 



1 



n+M ^ M y ^ 

.n ^ n(n-l)^(— - 



7 ^-^ \n+ k — 1 n + k 

j=n+l k=i ^ 



1 



The second sum telescopes and has value 



—n(n — 1) ( — I ^ —(n — 1). 

If p is Euler's constant then the first sum is 

n ^ 

salog(n + M) + p- V-. 

If n = 10 and M = 1000 then we end up with 

(27) 10 • [6.9177 + 0.5772 - 2.929] - 9 = 36.66. 

3. Genealogies. A simple description and a useful mental picture of ge- 
nealogies in an exponentially growing population is provided by the following 
result of Kingman (1982). 

Theorem 8. // we run time at rate 1/N{s) then on the new time scale 
genealogies follow the standard coalescent in which there is coalescence at 
rate (2) when there are k lineages. 

When N{t) = Ne~^^ the time interval [0, (l/7)logiV) over which the 
model makes sense gets mapped by the time change to an interval of length 

1 /■{V7)logAf 1 iV- 1 1 

— / e^^dt = -— < -. 

N Jo N 

While Theorem 8 is useful conceptually, it is difficult to use for compu- 
tations because after the time change mutations occur at a time-dependent 
rate. Back on the original time scale, Griffiths and Tavare (1998) have shown 
that the joint density of the coalescent times (T^., . . . ,r„) for any A; > 2 is 
given by 

(28) p.,.(^„ . . . , t.) = n ^ exp [- l^^^ ^ ds^ , 
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where = tn+i < tn < ■ ■ ■ < tk- In particular when k = n and N{t) = Ne~'^* 

, X , ^ nin — 1) ^. / n(n — 1) , 

(29) p„(t„) = ^^e^*" ^"P(-^^^^' - 1) 

One can, in principle at least, find the marginal distribution of by 
integrating out the variables in (28). According to (5)-(8) in 

Polanski, Bobrowski, and Kimmel (2003) 

n 

Pk{tk) = '^Ajqj{tk) where 

j=k 

(30) 



and the coefficients Aj are given by A!^ = 1 

.k Y[e=k,e^j (2) t r ^ J ; / ■ / 

Aj = — -T— for k <n and k <j <n. 

'n.e=k,ej^ji{2) ~ (2)] 

We have said in principle earlier because the coefficients grow rapidly and 
have alternating signs, which to quote the authors: "makes the use of this 
result for samples of size n > 50 difficult." 

Fortunately, for our purposes (29) is enough. From its derivation and the 
inequality >1 — x we have 



PiTn> 



. ( nin — 1) , ^. , 

t) = exp^-^^(e-*-l) 



> 1 - !^i!i^e^*. 



The right-hand side is at time Un = (I/7) log(2A^7/n(n — 1)) so 

1 , / 2Af7 \ nin - 1) ^, , 

i?T„>-log ^ ^ - W e^'ds 

7 \n{n — 1) / 2iV7 



(31) 

1 

> - 

7 



n{n — 1) 



This is within 0(1) of the time (l/7)logA^ at which the model stops making 
sense, so it follows that the expected values of 5^ = — T^+i are 0(1) for 
2<k<n. 
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4. Proofs of the wave k formulas (18) and (19). Our next topic is the 
waiting time for the first type k + 1: 

P{Tk+i > t\Tt) = exp(^- ^ Zl{s) ds^ « expi-Uk+iVke^'^'/Xk). 

Taking expected value and using Theorem 7 

P{Tk+i > ml,) = (1 + c,,fc^fc(nfc+ie^*7Afc)^"/^*)-^ 
Using the definition of iJ,k+i the median t'y^ is defined by 

and solving gives 

1 / A''*"'''^*' \ 1 1 
*i/2^ = ^ — = ^ log(^fc) - ^ log(ce,fc/xfc+i) 

which is (18). As in the case of ti 

P{Tk+i > t^/V + ^/^o) « (1 + e")~'. 

Again the result for the median 5^/2^ time cifc+i of the first mutation 

to type k + 1 with a family that does not die out can be found by replacing 

Uk+i by lifc+iAfc+i/cfc+i. Using ^k+i = l^'ku'k+i'' ^1'°™ (^6) when we do this 
gives 

k+l 1 Xkttk + l \ 1 



(32) = -log _ _iog(c,,,^,). 

To simplify and to relate our result to (22), we will look at the difference 

fc+i k 1 1 AfcOfc+i \ 1 / Afc_iafc 
s^h. -Sl/2 = ^log ^ -1 log 



-^iog(4f^-^-i"^'^-^)> 

where in the second term we have used (15) and (16) to evaluate ce,fc/c6»,fc-i 
and Recalling the formula 

c/i,fc = — (^) r(afc)r(l-afc) with OA; = Afc_i/Afc 
o-k \M/ 

given in (14) we have 

which is (19). To see this note that the from the last term and the l/a^ 
from the Ch,k cancel with parts of the second term, and the (ak/Xk)"'' from 
the third ends up in the first. 
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